2,868 research outputs found

    ViZDoom: DRQN with Prioritized Experience Replay, Double-Q Learning, & Snapshot Ensembling

    Full text link
    ViZDoom is a robust, first-person shooter reinforcement learning environment, characterized by a significant degree of latent state information. In this paper, double-Q learning and prioritized experience replay methods are tested under a certain ViZDoom combat scenario using a competitive deep recurrent Q-network (DRQN) architecture. In addition, an ensembling technique known as snapshot ensembling is employed using a specific annealed learning rate to observe differences in ensembling efficacy under these two methods. Annealed learning rates are important in general to the training of deep neural network models, as they shake up the status-quo and counter a model's tending towards local optima. While both variants show performance exceeding those of built-in AI agents of the game, the known stabilizing effects of double-Q learning are illustrated, and priority experience replay is again validated in its usefulness by showing immediate results early on in agent development, with the caveat that value overestimation is accelerated in this case. In addition, some unique behaviors are observed to develop for priority experience replay (PER) and double-Q (DDQ) variants, and snapshot ensembling of both PER and DDQ proves a valuable method for improving performance of the ViZDoom Marine.Comment: 9 pages, 5 figure

    Deep Ordinal Reinforcement Learning

    Full text link
    Reinforcement learning usually makes use of numerical rewards, which have nice properties but also come with drawbacks and difficulties. Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years. In this paper, a general approach to adapting reinforcement learning problems to the use of ordinal rewards is presented and motivated. We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards. Additionally, we run evaluations on problems provided by the OpenAI Gym framework, showing that our ordinal variants exhibit a performance that is comparable to the numerical variations for a number of problems. We also give first evidence that our ordinal variant is able to produce better results for problems with less engineered and simpler-to-design reward signals.Comment: replaced figures for better visibility, added github repository, more details about source of experimental results, updated target value calculation for standard and ordinal Deep Q-Networ

    Understanding structure of concurrent actions

    Get PDF
    Whereas most work in reinforcement learning (RL) ignores the structure or relationships between actions, in this paper we show that exploiting structure in the action space can improve sample efficiency during exploration. To show this we focus on concurrent action spaces where the RL agent selects multiple actions per timestep. Concurrent action spaces are challenging to learn in especially if the number of actions is large as this can lead to a combinatorial explosion of the action space. This paper proposes two methods: a first approach uses implicit structure to perform high-level action elimination using task-invariant actions; a second approach looks for more explicit structure in the form of action clusters. Both methods are context-free, focusing only on an analysis of the action space and show a significant improvement in policy convergence times

    The Dreaming Variational Autoencoder for Reinforcement Learning Environments

    Get PDF
    Reinforcement learning has shown great potential in generalizing over raw sensory data using only a single neural network for value optimization. There are several challenges in the current state-of-the-art reinforcement learning algorithms that prevent them from converging towards the global optima. It is likely that the solution to these problems lies in short- and long-term planning, exploration and memory management for reinforcement learning algorithms. Games are often used to benchmark reinforcement learning algorithms as they provide a flexible, reproducible, and easy to control environment. Regardless, few games feature a state-space where results in exploration, memory, and planning are easily perceived. This paper presents The Dreaming Variational Autoencoder (DVAE), a neural network based generative modeling architecture for exploration in environments with sparse feedback. We further present Deep Maze, a novel and flexible maze engine that challenges DVAE in partial and fully-observable state-spaces, long-horizon tasks, and deterministic and stochastic problems. We show initial findings and encourage further work in reinforcement learning driven by generative exploration.Comment: Best Student Paper Award, Proceedings of the 38th SGAI International Conference on Artificial Intelligence, Cambridge, UK, 2018, Artificial Intelligence XXXV, 201

    Geometry of Policy Improvement

    Full text link
    We investigate the geometry of optimal memoryless time independent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most kk actions whenever at most kk world states are consistent with the agent's observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identifies a polyhedral cone of policy changes in which the state value function increases for all states.Comment: 8 page

    Crawling in Rogue's dungeons with (partitioned) A3C

    Full text link
    Rogue is a famous dungeon-crawling video-game of the 80ies, the ancestor of its gender. Rogue-like games are known for the necessity to explore partially observable and always different randomly-generated labyrinths, preventing any form of level replay. As such, they serve as a very natural and challenging task for reinforcement learning, requiring the acquisition of complex, non-reactive behaviors involving memory and planning. In this article we show how, exploiting a version of A3C partitioned on different situations, the agent is able to reach the stairs and descend to the next level in 98% of cases.Comment: Accepted at the Fourth International Conference on Machine Learning, Optimization, and Data Science (LOD 2018

    Improving Search through A3C Reinforcement Learning based Conversational Agent

    Full text link
    We develop a reinforcement learning based search assistant which can assist users through a set of actions and sequence of interactions to enable them realize their intent. Our approach caters to subjective search where the user is seeking digital assets such as images which is fundamentally different from the tasks which have objective and limited search modalities. Labeled conversational data is generally not available in such search tasks and training the agent through human interactions can be time consuming. We propose a stochastic virtual user which impersonates a real user and can be used to sample user behavior efficiently to train the agent which accelerates the bootstrapping of the agent. We develop A3C algorithm based context preserving architecture which enables the agent to provide contextual assistance to the user. We compare the A3C agent with Q-learning and evaluate its performance on average rewards and state values it obtains with the virtual user in validation episodes. Our experiments show that the agent learns to achieve higher rewards and better states.Comment: 17 pages, 7 figure

    Multi-agent Hierarchical Reinforcement Learning with Dynamic Termination

    Full text link
    In a multi-agent system, an agent's optimal policy will typically depend on the policies chosen by others. Therefore, a key issue in multi-agent systems research is that of predicting the behaviours of others, and responding promptly to changes in such behaviours. One obvious possibility is for each agent to broadcast their current intention, for example, the currently executed option in a hierarchical reinforcement learning framework. However, this approach results in inflexibility of agents if options have an extended duration and are dynamic. While adjusting the executed option at each step improves flexibility from a single-agent perspective, frequent changes in options can induce inconsistency between an agent's actual behaviour and its broadcast intention. In order to balance flexibility and predictability, we propose a dynamic termination Bellman equation that allows the agents to flexibly terminate their options. We evaluate our model empirically on a set of multi-agent pursuit and taxi tasks, and show that our agents learn to adapt flexibly across scenarios that require different termination behaviours.Comment: PRICAI 201

    Hi-Val: Iterative Learning of Hierarchical Value Functions for Policy Generation

    Get PDF
    Task decomposition is effective in manifold applications where the global complexity of a problem makes planning and decision-making too demanding. This is true, for example, in high-dimensional robotics domains, where (1) unpredictabilities and modeling limitations typically prevent the manual specification of robust behaviors, and (2) learning an action policy is challenging due to the curse of dimensionality. In this work, we borrow the concept of Hierarchical Task Networks (HTNs) to decompose the learning procedure, and we exploit Upper Confidence Tree (UCT) search to introduce HOP, a novel iterative algorithm for hierarchical optimistic planning with learned value functions. To obtain better generalization and generate policies, HOP simultaneously learns and uses action values. These are used to formalize constraints within the search space and to reduce the dimensionality of the problem. We evaluate our algorithm both on a fetching task using a simulated 7-DOF KUKA light weight arm and, on a pick and delivery task with a Pioneer robot

    ContextVP: Fully Context-Aware Video Prediction

    Full text link
    Video prediction models based on convolutional networks, recurrent networks, and their combinations often result in blurry predictions. We identify an important contributing factor for imprecise predictions that has not been studied adequately in the literature: blind spots, i.e., lack of access to all relevant past information for accurately predicting the future. To address this issue, we introduce a fully context-aware architecture that captures the entire available past context for each pixel using Parallel Multi-Dimensional LSTM units and aggregates it using blending units. Our model outperforms a strong baseline network of 20 recurrent convolutional layers and yields state-of-the-art performance for next step prediction on three challenging real-world video datasets: Human 3.6M, Caltech Pedestrian, and UCF-101. Moreover, it does so with fewer parameters than several recently proposed models, and does not rely on deep convolutional networks, multi-scale architectures, separation of background and foreground modeling, motion flow learning, or adversarial training. These results highlight that full awareness of past context is of crucial importance for video prediction.Comment: 19 pages. ECCV 2018 oral presentation. Project webpage is at https://wonmin-byeon.github.io/publication/2018-ecc
    • …
    corecore